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ABSTRACT 

Transcription factors (TFs) do not function alone but 
work together with other TFs (called co-TFs) in a 
combinatorial fashion to precisely control the tran- 
scription of target genes. Mining co-TFs is thus 
important to understand the mechanism of tran- 
scriptional regulation. Although existing methods 
can identify co-TFs, their accuracy depends 
heavily on the chosen background model and 
other parameters such as the enrichment win- 
dow size and the PWM score cut-off. In this study, 
we have developed a novel web-based co-motif 
scanning program called CENTDIST (http:// 
compbio.ddns.comp.nus.edu.sg/~chipseq/centdist/). 
In comparison to current co-motif scanning 
programs, CENTDIST does not require the input of 
any user-specific parameters and background 
information. Instead, CENTDIST automatically de- 
termines the best set of parameters and ranks 
co-TF motifs based on their distribution around 
ChlP-seq peaks. We tested CENTDIST on 14 
ChlP-seq data sets and found CENTDIST is more 
accurate than existing methods. In particular, 
we applied CENTDIST on an Androgen Receptor 
(AR) ChlP-seq data set from a prostate cancer cell 
line and correctly predicted all known co-TFs (eight 
TFs) of AR in the top 20 hits as well as discovering 
AP4 as a novel co-TF of AR (which was missed by 
existing methods). Taken together, CENTDIST, 
which exploits the imbalanced nature of co-TF 
binding, is a user-friendly, parameter-less and 



powerful predictive web-based program for under- 
standing the mechanism of transcriptional co- 
regulation. 

INTRODUCTION 

In order to precisely regulate the expression of target 
genes, transcription factors (TFs) bind to specific short 
stretches of DNA sequences or motifs in our genome. 
Generally, a gene is not regulated by only a single TF, 
but instead by a combination of TFs binding to chromatin 
in close proximity. For example, the Androgen Receptor 
(AR) and the forkhead factor, FoxAl, are co-localized 
together at AR-binding sites (ARBS) to regulate the tran- 
scription of AR-dependent genes in prostate cancer cells 
(1), whereas, Sox2, Oct4 and Nanog all converge together 
at enhanceosomes to control genes involved in 
pluripotency and self-renewal in embryonic stem (ES) 
cells (2). TFs that co-localize and collaborate together 
are known as co-associated TFs (or co-TFs) of each other. 

Identifying co-TFs is an important step in understand- 
ing the mechanism of transcriptional regulation. Recent 
advances in ChlP-seq and the wide adoption of the tech- 
nology in mapping TF-binding sites has allowed research- 
ers to identify novel co-TFs (3). Currently, co-TFs of a 
selected TF are identified in the following manner. First, a 
peak calling program such as MACS (4) or CCAT (5) is 
used to determine which peaks in the ChlP-seq data are 
binding sites. Next, candidate co-TFs are predicted by 
examining if their motifs (position weight matrix, PWM) 
are enriched near the ChlP-seq peaks after normalizing 
against a chosen background model. TFs with enriched 
motifs are classified as potential co-TF candidates and 
subsequently validated experimentally. This approach, 
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known as the enrichment based method, has been widely 
used to identify novel co-TFs in web-based programs such 
as CEAS (6), CORETF (7), ConTra (8) and oPOSSUM 
(9). However, there are occasions when this approach 
fails to find co-TFs. This is because the accuracy of 
enrichment-based methods is highly dependent on 
several user-specific parameters including: (i) the back- 
ground (which models the non-binding sites); (ii) the en- 
richment window size (which models the distance between 
the co-TF and the peak); and (iii) the PWM score (10) 
cut-off (which determines if a site can be bound by the 
co-TF or not). Since different co-TFs require different 
parameters, existing methods can only identify co-TFs 
that satisfy the parameters specified by the user. This re- 
striction thus limits the accuracy of existing methods. To 
avoid this problem, it would be ideal to have a method 
that automatically determines the background and esti- 
mates the enrichment window size as well as the PWM 
score cut-off for every co-TF. 

Recently, several studies showed that if two TFs are 
co-associated, their ChlP-seq peaks (or their binding 
sites) are not only in close proximity with each other, 
but the relative distance of each TF with respect to the 
other exhibits a peak-like distribution (1,2,1 1). We call this 
property the center distribution. Herein, we examine 
whether center distribution can be utilized for co-TF dis- 
covery. Moreover, we have developed a method called 
CENTDIST (http://compbio.ddns.comp.nus.edu.sg/ 
~chipseq/centdist/), which ranks TFs based on their 
center distribution score. Unlike existing enrichment 
based methods, CENTDIST does not require any 
user-specific parameters. It can automatically optimize 
the enrichment window size and the PWM score cut-off. 
Furthermore, CENTDIST can predict weakly or margin- 
ally enriched co-TFs. In term of usability, CENTDIST is 
fast, user-friendly, and capable of handling data sets with 
over a million ChlP-seq peaks. The web-interface of 
CENTDIST also provides useful additional information 
that helps users select the best co-TF candidates. 

We compared the performance of CENTDIST against 
two enrichment-based programs on 1 3 ChlP-seq data sets 
generated for 13 TFs from mouse ES cells (2). Our 
large-scale comparison showed that CENTDIST was the 
best performer amongst the three programs. We also 
applied CENTDIST on an AR ChlP-seq data set 
generated from a prostate cancer cell line. CENTDIST 
was sensitive enough to discover all known co-TFs 
(eight co-TFs) of AR within top 20 hits. Furthermore, 
CENTDIST identified AP4 as a novel co-TF of AR, 
which was not found by traditional enrichment-based 
methods. Taken together, CENTDIST is a powerful and 
user-friendly tool for studying the mechanism of TF 
co-regulation. 

METHODOLOGY AND RESULTS 

Imbalanced distribution of TF motifs around 
ChlP-seq peaks 

Accurately predicting all the co-TFs of a particular TF 
from a ChlP-seq experiment is computationally 



challenging because some co-TFs may occur infrequently 
while the location of others are less certain than that of the 
ChlPed TF (Chromatin Immuno-precipitated TF in 
ChlP-seq experiment). Previous reports suggested that 
motifs of co-TFs are enriched around ChlP-seq peaks 
after normalizing against a particular background model 
(12,13). However, it is often difficult to choose the correct 
background model. Furthermore, it is also not easy to 
select the correct PWM score cut-off or the proper enrich- 
ment window size of co-TFs. 

Instead of asking users to specify these parameters, we 
have developed a new program called CENTDIST which 
requires minimal input from users. Users only need to 
enter a set of genomic locations representing ChlP-seq 
peaks (chromosome-peak summit position) and a list of 
candidate PWM motifs [provided by users or obtained 
from either the TRANSFAC (14) or JASPAR (15) data- 
bases] representing co-TF-binding sites. Based only these 
two pieces of information, CENTDIST will compute the 
distribution of motif occurrences with respect to the peak 
summit (frequency graph) for each PWM motif under dif- 
ferent PWM score cut-offs. CENTDIST will then find an 
optimal set of parameters that maximizes the frequency 
score (see below) near the ChlP-seq peaks. Given a fre- 
quency graph, and an enrichment window size d, we 
denote m t and m 0 to be the frequencies within and 
outside the enrichment window, respectively. The fre- 
quency score is defined as: 

7 - 7( m ° + ^ 

-^Frequency — ^ I ™U , , JTlfvYYl 0 I 
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where Z(x, p, n) = {x — np)j sqrt(np{\ — p)) is the normal 
approximation of the binomial Z-score for observing x 
successes out of n trials where the probability of a 
success trial is p. For example, we examined an AR 
ChlP-seq data set that was generated in our lab from 
the prostate cancer cell line, LNCaP. As shown in 
Figure la, the AR motif (RGAACANNNTGTTCY) 
occurs much more frequent near the center of the AR 
peaks, when compared to the flanking regions. Thus, the 
AR motif is imbalancely distributed and would be con- 
sidered as having a good frequency score. 

However, there are occasions when noise (like CG/AT 
bias) could also be imbalancely distributed around 
ChlP-seq peaks. Although such noise may be enriched, 
we expect it will not change dramatically near the center 
of ChlP-seq peaks compared to flanking regions. 
Therefore, to account for noise in the data, we included 
a function called the velocity score. The velocity score is 
derived from a velocity graph of the co-TF motif 
(Figure lb) which is generated from the slope of the fre- 
quency graph (Figure la). If noise is assumed to change 
slowly (or linearly), the velocity of noise will be near zero; 
otherwise, it will change dramatically near to the peaks as 
compared to the flanking regions. Specifically, the velocity 
score is a Z-score which measures if the velocity is changed 
dramatically. Similar to frequency score, given a vel- 
ocity graph and an enrichment window size d, we de- 
note the positive and negative velocity within the 
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Figure 1. AR motif distribution analysis of our AR ChlP-seq data set. (a) Frequency graph of the AR motif in an AR ChlP-seq data set. 
(b) Velocity graph of the AR motif in an AR ChlP-seq data set. In each graph, the dotted line partitions the distribution into the enriched 
region (left region) and the flanking region. The generation of these two graphs can be found in Supplementary Section 1.1. 



window m j+ , m t _ and outside the window m a+ , m 0 _ and 
the velocity score is defined as: 

-7 -?(\ I.. I (\mi-\+\m 0+ \) \ 
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where M—(\ m i+ | + 1 m Q _ \ + \ m i+ \ + | w 0 _ |). 

In short, CENTDIST will also take into consideration the 
velocity distribution of motif occurrences (velocity graph), 
which will correct the frequency score biases due to CG 
(or AT) variation in the regions around the ChlP-seq 
peaks. The scoring function used by CENTDIST to 
assess motif distribution is called the center distribution 
score (implementation details can be found in 
Supplementary Section 1), which is the sum of two com- 
ponents: frequency score and velocity score. For example, 
we observed a dramatic change in velocity (or slope) for 
the AR motif in the enriched region of the AR ChlP-seq 
peaks while the velocity remained constant in the flanking 
region (Figure lb). In such instance, the AR motif would 
be classified as having a good velocity score. 

Figure 2 demonstrates the capability of CENTDIST to 
promote true positive and repress false positive. To dem- 
onstrate the former, we consider the motif occurrence of 
V$AR_02 around AR ChlP-seq peaks. As shown in 
Figure 2a, the Z-score progressively increases as we use 
flanking region as background (instead of promoter or 
random region), select the optimal window, select 
the optimal PWM cut-off and finally considering the 
velocity. To demonstrate the latter, we study the 
CG-rich yeast TF motif, FSADR101, which would 
have been determined incorrectly to be enriched around 
the Pol II (RNA polymerase II) ChlP-seq peaks in human 
K562 cells (16) using traditional approach. We know this 
motif is not actually enriched because Pol Il-binding sites 
are enriched for CpG islands, which are regions known to 
contain many CG repeats. As shown in Figure 2b, this 



motif has a modest center distribution score based on 
only the frequency score, but the final center distribution 
score was significantly lower after taking the velocity score 
into consideration. 

Verification of CENTDIST on a large scale ChlP-seq 
experiment 

To determine if CENTDIST can identify co-TFs better 
than existing enrichment based methods, we compared 
the performance of CENTDIST with two motif 
scanners, CORE_TF (http://grenada.lumc.nl/ 

HumaneGenetica/CORE_TF/) and CEAS (http://liulab 
.dfci.harvard.edu/CEAS/) (6,7). We chose CORETF 
and CEAS for our comparisons because they were the 
only programs we could find that report enriched motifs 
from user-defined genomic regions while other programs 
were limited to promoter regions only. For our compari- 
sons, each program was optimized to their best perform- 
ance (Supplementary Section 2.4). 

Recently, 13 TF ChlP-seq maps were generated from 
mouse ES cells (2). These 13 TFs were shown to cluster 
into two core transcriptional modules called MTLs 
(multiple TF-binding loci). Because numerous co-TF rela- 
tionships were discovered from the 13 factors, we decided 
to use these data sets for our comparisons of the three 
motif scanners. Only genomic locations of the ChlP-seq 
peaks and motifs from the TRANSFAC database were 
entered into CENTDIST. For CORE TF and CEAS, 
input sequences with different window size (±100, ±200 
and ±500 bp) around the summit of the ChlP-seq peaks 
were extracted and different background settings were 
tested. The results from each program were compared 
against a table containing the co-TF motifs for each of 
the 13 ES TFs (Supplementary Table 7). 

We assessed the performance of each program by the 
area under the receiver operating characteristic (ROC) 
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Figure 2. Demonstration of CENTDIST Capability, (a) CENTDIST enhances the Z-score of the AR motif in the AR ChlP-seq data set (LNCaP 
cell line). The blue bar and red bar show the Z-scores of the AR motif computed using the traditional enrichment method under the window size of 
200 bp and the default PWM cut-off (1.32), respectively. The green bars show the Z-score of the AR motif computed by CENTDIST after it 
optimized different parameters. In the initial stage, the frequency Z-score was calculated using flanking regions at 200 bp as background and default 
PWM cut-off. In the second stage (window size), CENTDIST finds the best window size to maximize the Z-Score, in which the enrichment window 
size of AR is changed from 200 to 40 bp. In the third stage (cut-off), CENTDIST finds the best PWM cut-off to maximize the Z-Score, which leads 
to the flanking region noise level dropping significantly. In the fourth stage, CENTDIST combines the Z-scores of both the frequency graph and the 
velocity graph, thus further increasing the Z-Score. (b) CENTDIST can repress the Z-score of the false CG-rich motif in the Pol II ChlP-seq data set 
compared to the traditional overrepresentation methods. All Z-scores are computed exactly as in (a). Since CENTDIST considers the velocity graph 
of the false CG-rich motif, the combined Z-score of CENTDIST finally drops and is significantly lower than that computed by the traditional 
enrichment based method. As a side note, this figure also showed that random background can produce quite different results compared to promoter 
background, which highlights the difficulty of choosing a correct background in existing enrichment based methods. 
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Figure 3. Co-motif analysis of 13 ES cell TFs using CENTDIST, CEAS and CORE_TF. (a) A comparison of co-motif analysis results using 
CENTDIST, CORETF and CEAS on 13 different ChlP-seq data sets from ES cell. The best setting in each data set for CORETF and CEAS were 
used for comparison. CENTDIST* = CENTDIST algorithm without the inclusion of velocity score, (b) Heat map representing the analysis of 1 1 ES 
cell core TFs motif enrichment in 13 ChlP-seq experiments. Every row corresponds to a PWM motif while every column corresponds to a ChlP-seq 
data set. The color of each entry presents the Z-score (in log scale) of the motifs with respect to the peaks of the ChlP-seq data set. The figure 
showed that the enhancer motifs are enriched in the enhancer ChlP-seq data sets (top-left gene rectangle) while the promoter motifs are enriched in 
the promoter ChlP-seq data sets (bottom-right green rectangle). 



curve (AUC) (17), which ranges from 0 to 1 (a score of 0.5 
is equivalent to random guessing). The details of how 
AUC scores were calculated can be found in 
Supplementary Section 2.6. Based on AUC scores, our 
results showed that CENTDIST significantly outper- 
formed the best result from both CEAS and CORE TF 
(Figure 3a and Supplementary Table S3). We noticed that 
for CEAS and CORE TF, different configurations led to 
different performances, which highlights the difficulty in 
selecting the appropriate parameters for co-motif analysis 
since no single set of parameters can be considered the best 
for each ChlP-seq data set. CENTDIST, which requires 
neither background nor other parameter settings, per- 
formed significantly better (average AUC score = 0.905) 
than the best configuration of CEAS (average AUC 
score = 0.740) or CORETF (average AUC 
score = 0.84084). Furthermore, we compared the results 
of CENTDIST with the results ranked by frequency score 
only (denoted as CENTDIST* in Figure 3a and 



Supplementary Table S3). Overall, we found 
CENTDIST was better than CENTDIST* in 1 1 out of 
13 experiments. 

Next, we examined the center distribution scores of 1 1 
ES TF motifs (Smadl and p300 do not have known motif) 
across 13 TF ChlP-seq data sets (Figure 3b). From this 
analysis, we clearly saw two functional groups: the 
enhancer motifs (Oct4, Sox2, Nanog and Stat3) have 
good center distribution score in the enhancer TF 
ChlP-seq data sets (top-left green box), while the 
promoter motifs (cMyc, nMyc, Zfx and E2fl) have good 
center distribution score in the promoter TF ChlP-seq 
data sets (bottom-right green box) (Figure 3b). These 
results are in agreement with our previous findings (2). 
Moreover, enhancer motifs did not show good center dis- 
tribution in the promoter ChlP-seq data sets, and vice 
versa. The only exception was Stat3, which was classified 
as an enhancer TF but had good center distribution at the 
promoter. However, a recent report showed that Stat3 was 
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Figure 4. AP4 is a novel co-TF of AR. (a) ChlP-qPCR of AP4 was performed on 22 randomly selected AR peaks containing AP4 motifs in LNCaP 
cells before and after 2 h of DHT treatment, (b) Venn diagram depicting the overlap between the ChlP-seq peaks of AR and AP4. (c) AP4 ChlP-seq 
peak distribution around AR ChlP-seq peak, (d) Association of up-regulated genes with binding sites containing AR + AP4, AR only, or others. 



also enriched in the promoter regions of ES cells, suggest- 
ing this TF can be located at both promoter and enhancer 
regions (18). In conclusion, the results from this large- 
scale comparison study demonstrate that center distribu- 
tion is a good statistical model for predicting the 
occurrences of co-TF motifs from ChlP-seq data. 

Identification of novel co-associated factors of AR in 
prostate cancer cells by CENTDIST 

AR is a member of the nuclear hormone receptor super- 
family that is important in the progression of prostate 
cancer (19). Recent global analyses of AR in the prostate 
cancer cell line, LNCaP, revealed several co-TFs 
(e.g. FoxAl, Octl, Etsl, etc.) that collaborate with AR in 
mediating androgen-dependent transcription (1). As an in- 
dependent assessment of CENTDIST performance with 
respect to CEAS and CORE_TF, and also to discover 
potential new co-TFs of AR, we compared the results of 
the three programs on our AR ChlP-seq data set from 
LNCaP cells. Again our results showed CENTDIST was 
the best performer among the three programs. CENTDIST 
identified AR and all seven known co-TFs of AR within the 
top 20 hits (first two columns in Supplementary Table S4). 
This result was significantly better than CEAS, which failed 
to find five of the known AR co-TFs. CORE TF, 
optimized with a random background setting and 400-bp 
extracted-window size, identified all known AR co-TFs, 
however, this was within the top 37 hits. AUC analysis 



also indicated that CENTDIST outperformed the other 
motif scanners even under their best configurations 
(Supplementary Table S4). Furthermore, we also examine 
the shape of the frequency and the velocity graphs of the 
known co-TF motifs of AR (Supplementary Figure S3). 
We found that all of them have good shape even though 
their enrichment were not as significant as that of AR. 
Taken together, these results suggest that the frequency 
and velocity scores of co-motifs are useful information 
for determining the true motif signals. 

Next, we validated a co-TF predicted by CENTIST. We 
chose a co-TF that was predicted by CENTDIST but not 
by the other programs. Specifically, we selected the AP4 
motif, which was ranked 21st by CENTDIST. AP4 
belongs to the basic helix-loop-helix (bHLH) family of 
TFs. It functions as a homodimer and is known to play 
important roles in colorectal cancer (20), however our 
understanding of this TF in prostate cancer is limited. 
To test if AP4 is a co-TF of AR, we randomly selected 
22 AR ChlP-seq peaks that contain the AP4 motif and 
performed ChlP-qPCR in LNCaP cells treated with and 
without DHT. As shown in Figure 4a, all 22 binding sites 
showed enrichment compared to the genomic control site, 
suggesting that AP4 is co-localized at ARBS. 
Furthermore, under DHT treatment (which recruits 
AR), the binding of AP4 was enhanced compared to 
vehicle (Ethanol) treatment. To further validate whether 
AP4 and AR are co-binding, we took an unbiased 
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Figure 5. CENTDIST web interface and program procedure. Users can input or upload ChlP-seq peak locations or the bed format peak-region 
data, and select the corresponding reference genome and the motif candidates (TRANSFAC, JASPAR or custom database). After submitting the 
job, the data will automatically be processed according to the CENTDIST analysis pipeline. Specifically, CENTDIST will scan the sequences 
(±1000 bp around the peaks) and obtain the occurrences of each PWM motif to generate the frequency graph and the velocity graph. Z-score is 
used to assess the enrichment around peaks for each graph. The center distribution score of each PWM motif is calculated as the sum of the two 
Z-scores. Finally, CENTDIST outputs a list of TF families ranked by the center distribution scores. 



approach and performed a ChlP-seq of AP4. As shown in 
Figure 4b, a large number (2296 out of 6082/38%) of AP4 
ChlP-seq peaks overlapped with AR. A distribution 
analysis of AP4 ChlP-seq peaks around ARBS confirmed 
that AP4 binds in close proximity (within ± 200 bp) to AR 
(Figure 4c). We scanned for the AP4 motif in the ChlP-seq 
peaks and found that 79.4% of the AR-AP4 overlapping 
peaks contain AP4 motif. In contrast, 40.8% of the AR 
only peaks contain AP4 motif, however, the center distri- 
bution score for the AP4 motif around these peaks was 
low (Supplementary Figure S5). 

Finally, we examined the fraction of androgen 
up-regulated genes (Supplementary Section 2.5) near AR 
and AP4 peaks. We divided the genes into three groups: 
genes with AR + AP4 peaks, genes with AR only peaks, 
and genes with no AR peaks. We found that the propor- 
tion of up-regulated genes in group 1 is 1.6-fold and 
3.7-fold more than that in groups 2 and 3, respectively 
(Figure 4d), suggesting that AP4 may co-localize with 
AR to directly up-regulate the transcription of androgen 
target genes. Taken together, our results show that 
CENTDIST can identify novel co-TFs, even ones that 
are ranked very low by enrichment based methods. 



CENTDIST web server implementation 

Based on our observation on the unbalanced distribution 
of co-TF motifs around ChlP-seq peaks, we developed a 
web application called CENTDIST for identifying co-TFs 
from ChlP-seq experiments. The general pipeline of 
CENTDIST is shown in Figure 5. 

CENTDIST is designed for the analysis high- 
throughput ChlP-seq data. Its web-user interface 
contains three main parts: input, job management and 
output. For input, CENTDIST accepts a list of 
ChlP-seq peaks. The ChlP-seq peak information can be 
formatted in the form of chromosome-position pairs or 
BED format genomic regions. CENTDIST is capable of 
supporting >1 million peak coordinates. The motifs used 
for scanning can be entered in the form of PWM or 
selected from either the TRANSFAC database (version 
6.0), which contains 849 matrices or the JASPAR 
database, which has 459 matrices. CENTDIST also 
provides options for users to easily filter PWM motif can- 
didates by taxonomy, or TF family. Finally, unlike other 
motif scanning programs, CENTDIST is totally param- 
eter free. Users are not required to provide the 
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background, the enrichment window size or even choose 
the FDR or PWM cut-off for the PWM motifs. All these 
parameters will be estimated by CENTDIST 
automatically. 

With regards to job management, submitted jobs will be 
sent to the job queue on the server and processed based 
on a first come first serve policy. Users can view the status 
of their submitted jobs, and access or delete the results 
of previous runs at the 4 Job Management' page 
(Supplementary Section 5.4). The page refreshes automat- 
ically and email notifications will be sent to users once the 
jobs are completed. 

The main output page for CENTDIST is a table con- 
taining PWM motifs ranked according to center distribu- 
tion scores (Supplementary Section 5.5). Each row in the 
table presents the enriched TF family, and user can click 
on a link associated with each TF family to browse the 
result of each individual member. The output also 
contains visualization features like the PWM logo (21) 
of the motif, the frequency graph (center view and 
folding view), and other useful numeric features like 
binding range (the enrichment window size), PWM score 
threshold (the cut-off that maximizes the center distribu- 
tion score), center distribution score and P-value. In 
addition, the output page provides the motif distribution 
across different peak ranks (columns 7 and 8 in 
Supplementary Figure SI 2), which is useful when the 
input peaks are sorted by some quality measure like 
ChlP-seq intensity. 



DISCUSSION 

The performance of existing motif scanners is heavily de- 
pendent on selecting the proper background and other 
parameter settings. Choosing the correct background, 
however, is currently considered an art. What is more, 
there is no one set of parameters that can satisfy all 
co-TFs. Finally, the assumption that noise is uniformly 
distributed may not be true when CG (or AT) content 
varies in ChlPed enriched regions. In this article, we 
present a new computational method called CENTDIST 
that utilizes frequency information as well as slope infor- 
mation (velocity) to predict whether a motif is real or not. 
CENTDIST does not require an explicit background 
model. Using the velocity score, CENTDIST is also in- 
sensitive to CG- or AT-biases. We examined CENTDIST 
on 14 ChlP-seq data sets and demonstrated that it is better 
than existing methods. We also show that this can be 
achieved without requiring expert knowledge in 
configuring the program. For the ChlP-seq of in LNCaP 
cells, CENTDIST discovered AP4 as a novel co-TF of 
AR, which was missed by existing enrichment-based 
methods. CENTDIST also discovered nine additional 
co-TF motifs that were unique to the program. For five 
of these co-TF motifs, evidence from literature suggests 
that they could be the motifs of potential collaborators 
of AR (Supplementary Table S5). 

CENTDIST does have certain limitations. For example, 
CENTDIST may fail to identify co-TFs whose binding 
site distribution does not follow the proximity assumption 



(i.e. co-TFs that are not co-localized with the ChlPed TF). 
However, the latter would not be found by traditional 
enrichment-based methods either since their binding sites 
are not enriched. 

CENTDIST is a user-friendly web-based application 
that is capable of analyzing large-scale ChlP-seq data 
sets. It can scan ~700 TRANSFAC motifs over a 
ChlP-seq data set containing 10000 peaks in only 
lOmin (Supplementary Table S9). With CENTDIST, 
users do not have to set any parameters except to 
upload the ChlP-seq peak locations and select the PWM 
motif library they wish to use for scanning. The output of 
CENTDIST contains clean and rich information for users. 
Specifically, it groups the list of enriched motifs into TF 
families, and provides other information including PWM 
logo, motif distribution graph, enrichment P-value and 
the enriched window size of the enriched motifs. 

To the best of our knowledge, CENTDIST is the first 
motif scanner for ChlP-seq data that utilizes the shape 
information (velocity) of the motif distribution, and auto- 
matically detects the size of the motif-enriched region and 
PWM score cut-off. It compares the enrichment inside/ 
outside of the enriched motif region without the input of 
additional background information. Although there is still 
room to improve the methodology, this study opens a new 
door for utilizing the shape information to extract bio- 
logically meaningful co-TFs in a ChlP-seq data set. 
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